The Framingham Heart Study is a long-term prospective study of the etiology of cardiovascular disease among a population of free-living subjects in the community of Framingham, Massachusetts. The Framingham Heart Study was a landmark study in epidemiology in that it was the first prospective study of cardiovascular disease and identified the concept of risk factors and their joint effects FHS Longitudinal Data Document.
The dataset is composed by 4240 rows (observations) and 16 columns (variables), which are described below:
In this analysis, we will apply three regularized regression techniques—Lasso, Ridge, and Elastic Net—to predict glucose levels based on various health-related features. We will also use use regularized logistic regression to predict the ten year risk of of coronary heart decease. Finally we will build SVMs to explore the same questions. These methods are used to address potential overfitting in predictive models, especially when there are many predictor variables or multicollinearity among them. We will perform exploratory data analysis to find some key findings and questions that lie with the dataset. Then imputing missing variables using the MICE method will be used for featuring engineering and selecting the features to answer the two questions for linear and logistical modeling.
library(dplyr)
library(reshape2)
library(tidyr)
library(ggplot2)
library(tibble)
#Load the sample data
hearts = read.csv("https://jessgorr01.github.io/STA551/sta552/project2/FraminghamHeartStudy.csv")
head(hearts)
male age education currentSmoker cigsPerDay BPMeds prevalentStroke
1 1 39 4 0 0 0 0
2 0 46 2 0 0 0 0
3 1 48 1 1 20 0 0
4 0 61 3 1 30 0 0
5 0 46 3 1 23 0 0
6 0 43 2 0 0 0 0
prevalentHyp diabetes totChol sysBP diaBP BMI heartRate glucose TenYearCHD
1 0 0 195 106.0 70 26.97 80 77 0
2 0 0 250 121.0 81 28.73 95 76 0
3 0 0 245 127.5 80 25.34 75 70 0
4 1 0 225 150.0 95 28.58 65 103 1
5 0 0 285 130.0 84 23.10 85 85 0
6 1 0 228 180.0 110 30.30 77 99 0
The first step in the analysis is looking at all the individual features. In order to perform this action, all of the variables the the dataset will be converted to factors. this can be seen in the code chunk below. Following this, all missing data will be identified and a summary will be given to get a general overview of the credit risk data. Then visuals will be given to looking at the distributions of individual factors, and what insights can be discovered for future predicting and analysis.
Below is a chunk to turn all the variables into factors.
library(dplyr)
#colnames(hearts)
# Convert all columns into factors
hearts_factorized <- hearts %>%
mutate(
male = factor(male), # Convert gender to factor (binary)
age = factor(age), # Convert age to factor (though it’s continuous, we can treat it as a category if needed)
education = factor(education, levels = 1:4, labels = c("Some high school", "High school/GED", "Some college/vocational school", "College")),
currentSmoker = factor(currentSmoker, levels = c(0, 1), labels = c("No", "Yes")),
cigsPerDay = factor(cigsPerDay), # Convert number of cigarettes per day to a factor
BPMeds = factor(BPMeds, levels = c(0, 1), labels = c("No", "Yes")),
prevalentStroke = factor(prevalentStroke, levels = c(0, 1), labels = c("No", "Yes")),
prevalentHyp = factor(prevalentHyp, levels = c(0, 1), labels = c("No", "Yes")),
diabetes = factor(diabetes, levels = c(0, 1), labels = c("No", "Yes")),
totChol = factor(totChol), # Convert cholesterol to factor
sysBP = factor(sysBP), # Convert systolic blood pressure to factor
diaBP = factor(diaBP), # Convert diastolic blood pressure to factor
BMI = factor(BMI), # Convert BMI to factor
heartRate = factor(heartRate), # Convert heart rate to factor
glucose = factor(glucose), # Convert glucose to factor
TenYearCHD = factor(TenYearCHD) # Convert response variable to factor (if it's binary)
)
# Check the structure of the modified dataset
#str(hearts_factorized)
# Check the first few rows to confirm changes
head(hearts_factorized)
male age education currentSmoker cigsPerDay BPMeds
1 1 39 College No 0 No
2 0 46 High school/GED No 0 No
3 1 48 Some high school Yes 20 No
4 0 61 Some college/vocational school Yes 30 No
5 0 46 Some college/vocational school Yes 23 No
6 0 43 High school/GED No 0 No
prevalentStroke prevalentHyp diabetes totChol sysBP diaBP BMI heartRate
1 No No No 195 106 70 26.97 80
2 No No No 250 121 81 28.73 95
3 No No No 245 127.5 80 25.34 75
4 No Yes No 225 150 95 28.58 65
5 No No No 285 130 84 23.1 85
6 No Yes No 228 180 110 30.3 77
glucose TenYearCHD
1 77 0
2 76 0
3 70 0
4 103 1
5 85 0
6 99 0
The summary of our data, including missing values is below:
summary(hearts_factorized)
male age education currentSmoker
0:2420 40 : 192 Some high school :1720 No :2145
1:1820 46 : 182 High school/GED :1253 Yes:2095
42 : 180 Some college/vocational school: 689
41 : 174 College : 473
48 : 173 NA's : 105
39 : 170
(Other):3169
cigsPerDay BPMeds prevalentStroke prevalentHyp diabetes
0 :2145 No :4063 No :4215 No :2923 No :4131
20 : 734 Yes : 124 Yes: 25 Yes:1317 Yes: 109
30 : 218 NA's: 53
15 : 210
10 : 143
(Other): 761
NA's : 29
totChol sysBP diaBP BMI heartRate
240 : 85 120 : 107 80 : 262 22.19 : 18 75 : 563
220 : 70 130 : 102 82 : 152 22.54 : 18 80 : 385
260 : 62 110 : 96 85 : 137 22.91 : 18 70 : 305
210 : 61 115 : 89 70 : 135 23.48 : 18 60 : 231
232 : 59 125 : 88 81 : 131 23.09 : 16 85 : 228
(Other):3853 124 : 84 84 : 122 (Other):4133 (Other):2527
NA's : 50 (Other):3674 (Other):3301 NA's : 19 NA's : 1
glucose TenYearCHD
75 : 193 0:3596
77 : 167 1: 644
73 : 156
80 : 153
70 : 152
(Other):3031
NA's : 388
From the summary, there are multiple missing observations, so we must use multiple imputation in order to fill the gaps.
In order to impute all the missing values in the data, multiple regression imputation will be done through the MICE library. Multiple Regression-based imputation is a method where missing numerical values are predicted using regression models based on other available features. This imputation method helps maintain relationships between variables and provides more accurate imputations compared to mode imputation for categorical variables and median/mean for numerical variables.
library(mice)
Warning: package 'mice' was built under R version 4.3.3
Warning in check_dep_version(): ABI version mismatch:
lme4 was built with Matrix ABI version 1
Current Matrix ABI version is 0
Please re-install lme4 from source or restore original 'Matrix' package
hearts_clean <- hearts_factorized %>% drop_na()
summary(hearts_clean)
male age education currentSmoker
0:2035 40 : 167 Some high school :1526 No :1869
1:1623 46 : 166 High school/GED :1101 Yes:1789
42 : 161 Some college/vocational school: 608
48 : 149 College : 423
39 : 147
41 : 145
(Other):2723
cigsPerDay BPMeds prevalentStroke prevalentHyp diabetes
0 :1869 No :3547 No :3637 No :2518 No :3559
20 : 651 Yes: 111 Yes: 21 Yes:1140 Yes: 99
30 : 192
15 : 184
10 : 123
5 : 99
(Other): 540
totChol sysBP diaBP BMI heartRate
240 : 69 130 : 90 80 : 217 23.48 : 18 75 : 507
220 : 58 120 : 87 82 : 138 22.54 : 16 80 : 336
260 : 58 110 : 83 85 : 119 22.91 : 15 70 : 269
232 : 54 125 : 79 70 : 114 22.19 : 14 60 : 207
210 : 51 115 : 75 81 : 114 25.09 : 14 85 : 192
230 : 50 124 : 73 78 : 104 23.09 : 13 72 : 184
(Other):3318 (Other):3171 (Other):2852 (Other):3568 (Other):1963
glucose TenYearCHD
75 : 180 0:3101
77 : 166 1: 557
70 : 150
73 : 146
83 : 145
78 : 139
(Other):2732
Now that we have cleaned data, we can move onto looking at the distributions and feature engineering.
For this analysis, examining the distribution of the variables is crucial to ensure the reliability and accuracy of the modeling results. To start we will look at the numerical variables. The continuous variables such as age, cholesterol levels, blood pressure, BMI, and glucose levels need to be assessed for normality, as deviations from normal distribution might suggest the need for data transformation or non-parametric methods.
Visualizations such as histograms for continuous variables and bar plots for categorical variables will help us identify any unusual patterns or skewed distributions, enabling us to apply the necessary preprocessing steps before fitting models like Lasso, Ridge, or Elastic Net regression. This ensures that the analysis is robust and the results are interpretable and reliable.
library(ggplot2)
library(plotly)
# Select only numerical variables
numerical_vars <- hearts[, sapply(hearts, is.numeric)]
# Create a list to store ggplot objects for each variable
plot_list <- list()
# Create a histogram for each numerical variable and store the plots in plot_list
for (var in names(numerical_vars)) {
# Check if the variable is numeric before plotting
if (is.numeric(hearts_clean[[var]])) {
p <- ggplot(hearts_clean, aes_string(x = var)) +
geom_histogram(binwidth = 10, fill = "skyblue", color = "black", alpha = 0.7) +
theme_minimal() +
labs(title = paste("Distribution of", var), x = var, y = "Frequency") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
} else {
# If not numeric, use geom_bar (or other appropriate plots)
p <- ggplot(hearts_clean, aes_string(x = var)) +
geom_bar(fill = "skyblue", color = "black", alpha = 0.7) +
theme_minimal() +
labs(title = paste("Distribution of", var), x = var, y = "Frequency") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
}
# Convert ggplot to plotly interactive plot
plot_list[[var]] <- ggplotly(p)
}
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
generated.
# Create individual interactive panels for each plot
for (p in plot_list) {
print(p)
}
(F………………….)
On the other hand, categorical variables like gender, smoking status, and education level should be checked for class imbalances, as significant imbalances could lead to biased model predictions. We will look at the categorical varibles belwo.
# Load necessary libraries
library(ggplot2)
library(plotly)
# Select only categorical variables (factors or characters)
categorical_vars <- hearts[, sapply(hearts, is.factor) | sapply(hearts, is.character)]
# Create a list to store ggplot objects for each categorical variable
plot_list_cat <- list()
# Create a bar plot for each categorical variable and store the plots in plot_list_cat
for (var in names(categorical_vars)) {
# Check if the variable is categorical
if (is.factor(hearts[[var]]) || is.character(hearts[[var]])) {
p <- ggplot(hearts, aes_string(x = var)) +
geom_bar(fill = "skyblue", color = "black", alpha = 0.7) +
theme_minimal() +
labs(title = paste("Distribution of", var), x = var, y = "Count") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Convert ggplot to plotly interactive plot
plot_list_cat[[var]] <- ggplotly(p)
}
}
# Create individual interactive panels for each categorical plot
for (p in plot_list_cat) {
print(p)
}
(…………….. )
Now that we have looking the the features individually, we can do some analysis to find insights between the relationships with other features. This insights can provide clarity on what future analysis should focus the attention on. This exploratory analysis will perform three relationship between: 1. two numerical variables 2. two categorical variables 3. one numerical and one categorical variable.
All insights will be explain through visuals and text.
First we will explore the relationship between two numerical variables, glucose levels and age.
# If 'age' or 'glucose' are not numeric, convert them
hearts_clean$age <- as.numeric(hearts_clean$age)
hearts_clean$glucose <- as.numeric(hearts_clean$glucose)
plot1_heart <- ggplot(hearts_clean, aes(x = age, y = glucose)) +
geom_point() +
ylab("Glucose") +
xlab("Age") +
scale_x_continuous(limits = c(0, 100)) +
theme_minimal()
plot1_heart
From this scatterplot, there does not seem to be an obvious relationship
between glucose and age. But, we must investigate further if there is
some interaction with other varibles.
Now looking at two categorical variables we can analyze them visually through a stacked bar plot. Here we will look at the relationship between ten year chd and education
df_percent_hearts <- hearts_clean %>%
group_by(education, TenYearCHD) %>%
summarise(count = n()) %>%
group_by(education) %>%
mutate(percent = count / sum(count) * 100)
# Create the stacked bar plot with percentages
ggplot(df_percent_hearts, aes(x = education, y = percent, fill = TenYearCHD)) +
geom_bar(stat = "identity") +
labs(title = "Stacked Bar Plot of Education by CHD",
x = "Education Level",
y = "Percentage") +
scale_fill_manual(values = c("red", "blue")) + # Adjust colors as needed
theme_minimal() +
theme(legend.position = "bottom")
From the stacked bar chart, once again we can see that there is not a significant relationship between education and and having CHD.
Finally we will explore the relationship between one numerical variable, heartRate, with a categorical feature, TenYearCHD. This will be explored using boxplots.
ggplot(hearts, aes(x = education, y = heartRate)) +
geom_boxplot() +
labs(title = "Box Plot of Education and HeartRate",
x = "HeartRate",
y = "Education") +
theme_minimal()
Warning: Continuous x aesthetic
ℹ did you forget `aes(group = ...)`?
Warning: Removed 105 rows containing missing values (`stat_boxplot()`).
Warning: Removed 1 rows containing non-finite values (`stat_boxplot()`).
Here we can see that are definitely some outliers with the data, but that could be due to other factors and must be explored further.
It was shown that there were some positively skewed distributions so these should be transformed using a log transformation. The code for these transformations and their resulting graphs are shown below.
# Convert relevant columns to numeric if they are factors
hearts_clean$age <- as.numeric(as.character(hearts_clean$age))
hearts_clean$totChol <- as.numeric(as.character(hearts_clean$totChol))
hearts_clean$BMI <- as.numeric(as.character(hearts_clean$BMI))
hearts_clean$glucose <- as.numeric(as.character(hearts_clean$glucose))
hearts_clean$heartRate <- as.numeric(as.character(hearts_clean$heartRate))
hearts_clean$diabetes <- as.numeric(as.character(hearts_clean$diabetes))
Warning: NAs introduced by coercion
# Apply transformations after ensuring the variables are numeric
hearts_clean <- hearts_clean %>%
mutate(
age_log = log1p(age), # log transformation of age
totChol_log = log1p(totChol), # log transformation of totChol
BMI_sqrt = sqrt(BMI), # square root transformation of BMI
glucose_sqrt = sqrt(glucose), # square root transformation of glucose
heartRate_inv = 1 / (heartRate + 1), # inverse transformation of heartRate
diabetes_inv = 1 / (diabetes + 1) # inverse transformation of diabetes
)
Above we have identified categorical variables that should be regrouped to simplify our model. Regrouping simplifies our model, which improves our power.
These can be shown being regrouped below:
# Regrouping Age into categories
hearts_filtered <- hearts_clean %>%
mutate(age_group = case_when(
age <= 30 ~ "Under 30",
age <= 50 ~ "30-50",
age <= 70 ~ "51-70",
age > 70 ~ "Above 70"
))
# Regrouping Education into categories (assuming you have education data coded with numbers or specific categories)
hearts <- hearts_filtered %>%
mutate(education_group = case_when(
education == 1 ~ "Low", # If "1" means low education level
education == 2 ~ "Medium", # If "2" means medium education level
education == 3 ~ "High" # If "3" means high education level
))
# Regrouping BP Medications (BPMeds) into categories
hearts <- hearts_filtered %>%
mutate(BPMeds_group = case_when(
BPMeds == 0 ~ "No Medication",
BPMeds == 1 ~ "On Medication"
))
# Check the results
head(hearts)
male age education currentSmoker cigsPerDay BPMeds
1 1 8 College No 0 No
2 0 15 High school/GED No 0 No
3 1 17 Some high school Yes 20 No
4 0 30 Some college/vocational school Yes 30 No
5 0 15 Some college/vocational school Yes 23 No
6 0 12 High school/GED No 0 No
prevalentStroke prevalentHyp diabetes totChol sysBP diaBP BMI heartRate
1 No No NA 195 106 70 26.97 80
2 No No NA 250 121 81 28.73 95
3 No No NA 245 127.5 80 25.34 75
4 No Yes NA 225 150 95 28.58 65
5 No No NA 285 130 84 23.10 85
6 No Yes NA 228 180 110 30.30 77
glucose TenYearCHD age_log totChol_log BMI_sqrt glucose_sqrt heartRate_inv
1 33 0 2.197225 5.278115 5.193265 5.744563 0.01234568
2 32 0 2.772589 5.525453 5.360037 5.656854 0.01041667
3 26 0 2.890372 5.505332 5.033885 5.099020 0.01315789
4 59 1 3.433987 5.420535 5.346027 7.681146 0.01515152
5 41 0 2.772589 5.655992 4.806246 6.403124 0.01162791
6 55 0 2.564949 5.433722 5.504544 7.416198 0.01282051
diabetes_inv age_group BPMeds_group
1 NA Under 30 <NA>
2 NA Under 30 <NA>
3 NA Under 30 <NA>
4 NA Under 30 <NA>
5 NA Under 30 <NA>
6 NA Under 30 <NA>
Now let’s move in discrectizing some of our numerical variables. This is useful for multiple reasons, including improving model interpretability, handling non-linearity, and reducing noisy data. This os grouping continuous data. Some of the variables we are going to discretize are age, income and interest rates.
Below is the code in order to discretize data
hearts <- hearts_clean %>%
# Discretize Age
mutate(age_group = case_when(
age <= 30 ~ "Young",
age > 30 & age <= 50 ~ "Middle-Aged",
age > 50 ~ "Older",
TRUE ~ NA_character_ # Handle unexpected values
)) %>%
# Discretize BMI
mutate(bmi_group = case_when(
BMI < 18.5 ~ "Underweight",
BMI >= 18.5 & BMI < 25 ~ "Normal Weight",
BMI >= 25 & BMI < 30 ~ "Overweight",
BMI >= 30 ~ "Obese",
TRUE ~ NA_character_
)) %>%
# Discretize Glucose levels
mutate(glucose_group = case_when(
glucose < 100 ~ "Normal",
glucose >= 100 & glucose < 126 ~ "Pre-diabetic",
glucose >= 126 ~ "Diabetic",
TRUE ~ NA_character_
)) %>%
# Discretize Cholesterol levels (TotChol)
mutate(cholesterol_group = case_when(
totChol < 200 ~ "Desirable",
totChol >= 200 & totChol < 240 ~ "Borderline High",
totChol >= 240 ~ "High",
TRUE ~ NA_character_
))
# Convert new columns to factors
# Check the first few rows
head(hearts)
male age education currentSmoker cigsPerDay BPMeds
1 1 8 College No 0 No
2 0 15 High school/GED No 0 No
3 1 17 Some high school Yes 20 No
4 0 30 Some college/vocational school Yes 30 No
5 0 15 Some college/vocational school Yes 23 No
6 0 12 High school/GED No 0 No
prevalentStroke prevalentHyp diabetes totChol sysBP diaBP BMI heartRate
1 No No NA 195 106 70 26.97 80
2 No No NA 250 121 81 28.73 95
3 No No NA 245 127.5 80 25.34 75
4 No Yes NA 225 150 95 28.58 65
5 No No NA 285 130 84 23.10 85
6 No Yes NA 228 180 110 30.30 77
glucose TenYearCHD age_log totChol_log BMI_sqrt glucose_sqrt heartRate_inv
1 33 0 2.197225 5.278115 5.193265 5.744563 0.01234568
2 32 0 2.772589 5.525453 5.360037 5.656854 0.01041667
3 26 0 2.890372 5.505332 5.033885 5.099020 0.01315789
4 59 1 3.433987 5.420535 5.346027 7.681146 0.01515152
5 41 0 2.772589 5.655992 4.806246 6.403124 0.01162791
6 55 0 2.564949 5.433722 5.504544 7.416198 0.01282051
diabetes_inv age_group bmi_group glucose_group cholesterol_group
1 NA Young Overweight Normal Desirable
2 NA Young Overweight Normal High
3 NA Young Overweight Normal High
4 NA Young Overweight Normal Borderline High
5 NA Young Normal Weight Normal High
6 NA Young Obese Normal Borderline High
Finally, we want to filter all our data into one cleaned dataset. We only want our discretized variables and those domain variables we discussed earlier. Below the code to filter the final dataset is seen below. This data is now able to be used for future analysis.
library(dplyr)
#colnames(hearts)
# Filter the dataset to include only the specified variables
hearts_final <- hearts %>%
select(male, age_log, education, currentSmoker, cigsPerDay, BPMeds,
prevalentStroke, prevalentHyp, totChol_log, BMI_sqrt, glucose, heartRate_inv, TenYearCHD)
# View the filtered dataset
head(hearts_final)
male age_log education currentSmoker cigsPerDay BPMeds
1 1 2.197225 College No 0 No
2 0 2.772589 High school/GED No 0 No
3 1 2.890372 Some high school Yes 20 No
4 0 3.433987 Some college/vocational school Yes 30 No
5 0 2.772589 Some college/vocational school Yes 23 No
6 0 2.564949 High school/GED No 0 No
prevalentStroke prevalentHyp totChol_log BMI_sqrt glucose heartRate_inv
1 No No 5.278115 5.193265 33 0.01234568
2 No No 5.525453 5.360037 32 0.01041667
3 No No 5.505332 5.033885 26 0.01315789
4 No Yes 5.420535 5.346027 59 0.01515152
5 No No 5.655992 4.806246 41 0.01162791
6 No Yes 5.433722 5.504544 55 0.01282051
TenYearCHD
1 0
2 0
3 0
4 1
5 0
6 0
Lasso (Least Absolute Shrinkage and Selection Operator) regression applies L1 regularization, encouraging sparsity in the model by driving some coefficients to zero, effectively performing feature selection. Ridge regression, on the other hand, uses L2 regularization, which penalizes large coefficients but does not eliminate variables. Elastic Net combines both L1 and L2 regularization, balancing the strengths of Lasso and Ridge. By comparing these three approaches, we can identify the most effective model for predicting glucose levels, while also improving model interpretability and generalizability.
Coefficient path analysis is a powerful technique used to visualize and interpret how the coefficients of a regression model change as a regularization parameter (such as lambda in Lasso, Ridge, or Elastic Net regression) is varied. This method provides insight into the stability of each predictor variable’s contribution to the model, helping to identify which features are most influential in predicting the target variable.
library(glmnet)
Warning: package 'glmnet' was built under R version 4.3.3
library(caret)
# Prepare the data
# Remove rows with missing target variable 'glucose'
hearts_final <- hearts_final[!is.na(hearts_final$glucose), ]
# Redefine X (predictors) and y (target)
X <- hearts_final[, -which(names(hearts_final) == "glucose")] # All predictors except 'glucose'
y <- hearts_final$glucose # Target variable 'glucose'
# Split the data into training and testing sets (80% train, 20% test)
set.seed(1234) # Set seed for reproducibility
train_index <- createDataPartition(y, p = 0.8, list = FALSE)
X_train <- X[train_index, ]
X_test <- X[-train_index, ]
y_train <- y[train_index]
y_test <- y[-train_index]
# Ensure categorical variables are converted to numeric (dummy/indicator variables)
# Convert X_train and X_test to numeric matrix with model.matrix
X_train <- model.matrix(~ . - 1, data = as.data.frame(X_train)) # Remove intercept (-1)
X_test <- model.matrix(~ . - 1, data = as.data.frame(X_test)) # Remove intercept (-1)
# Align columns of X_train and X_test to have the same variables
# This ensures that both training and testing datasets have the same set of columns
X_test <- X_test[, colnames(X_train), drop = FALSE]
# Fit Lasso (alpha = 1), Ridge (alpha = 0), and Elastic Net (alpha = 0.5)
fit_lasso <- glmnet(X_train, y_train, alpha = 1)
fit_ridge <- glmnet(X_train, y_train, alpha = 0)
fit_elastic_net <- glmnet(X_train, y_train, alpha = 0.5)
# Cross-validation for Lasso, Ridge, and Elastic Net
cv_lasso <- cv.glmnet(X_train, y_train, alpha = 1)
cv_ridge <- cv.glmnet(X_train, y_train, alpha = 0)
cv_elastic_net <- cv.glmnet(X_train, y_train, alpha = 0.5)
# Plot cross-validation results
par(mfrow = c(1, 3)) # Arrange plots in a row
plot(cv_lasso)
plot(cv_ridge)
plot(cv_elastic_net)
# Best lambda for each model
lambda_lasso <- cv_lasso$lambda.min
lambda_ridge <- cv_ridge$lambda.min
lambda_elastic_net <- cv_elastic_net$lambda.min
# Predict using the best lambda from each model
pred_lasso <- predict(fit_lasso, X_test, s = lambda_lasso, type = "response")
pred_ridge <- predict(fit_ridge, X_test, s = lambda_ridge, type = "response")
pred_elastic_net <- predict(fit_elastic_net, X_test, s = lambda_elastic_net, type = "response")
We can see that there are some insignificant predictor variables, and they should be dropped from the model. Using the step() function, we will now find the final model. The final best model will be a model that is between the full and reduced models.
library(glmnet)
# Plot the coefficient path
par(mar=c(5,4,6,3)) # Adjust margins to fit the title
plot(fit_lasso, xvar = "lambda", label = TRUE,
lwd = 1.5,
main = "Coefficient Path Analysis: LASSO (Hearts Dataset)",
cex.main = 0.9,
col = rainbow(ncol(X))) # Color for each coefficient
abline(v = 1, col = "purple", lty = 4, lwd = 2) # Vertical line for lambda = 1
abline(v = -1, col = "steelblue", lty = 2, lwd = 2) # Vertical line for lambda = -1
par(mar=c(5,4,6,3))
##
plot(cv_lasso, main = "RMSE Plot: LASSO",
cex.main = 0.9)
# Calculate RMSE for each model
rmse_lasso <- sqrt(mean((pred_lasso - y_test)^2))
rmse_ridge <- sqrt(mean((pred_ridge - y_test)^2))
rmse_elastic_net <- sqrt(mean((pred_elastic_net - y_test)^2))
cat("RMSE for Lasso: ", rmse_lasso, "\n")
RMSE for Lasso: 15.8167
cat("RMSE for Ridge: ", rmse_ridge, "\n")
RMSE for Ridge: 15.87287
cat("RMSE for Elastic Net: ", rmse_elastic_net, "\n")
RMSE for Elastic Net: 15.81683
Tuning the regularization parameter, lambda, is a crucial step in the process of fitting models these models. This parameter controls the strength of the penalty applied to the model, helping to avoid overfitting and improving model generalization. A large lambda value leads to greater regularization, resulting in smaller coefficients, while a smaller lambda value allows the model to fit the training data more closely, potentially leading to overfitting.
To identify the best lambda for each model, we use cross-validation (via the cv.glmnet function), which evaluates the model performance across different values of lambda. The process involves splitting the data into training and validation sets multiple times and calculating the prediction error for each candidate lambda. The optimal lambda is the one that minimizes the cross-validation error, ensuring that the model generalizes well to unseen data.
library(glmnet)
library(caret)
library(pander)
# Define the features (predictors) and target
X <- as.matrix(hearts_final[, -which(names(hearts_final) == "glucose")]) # All features except target 'glucose'
y <- hearts_final$glucose # Target variable 'glucose'
# Split the data into training and testing sets
set.seed(123) # Set seed for reproducibility
train_index <- createDataPartition(y, p = 0.8, list = FALSE)
X_train <- X[train_index, ]
X_test <- X[-train_index, ]
y_train <- y[train_index]
y_test <- y[-train_index]
# Cross-validation to find the best lambda for each model
cv_lasso <- cv.glmnet(X_train, y_train, alpha = 1)
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
cv_ridge <- cv.glmnet(X_train, y_train, alpha = 0)
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
cv_elastic_net <- cv.glmnet(X_train, y_train, alpha = 0.5)
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
# Extract the best lambda values for each model
best.lasso.lambda <- cv_lasso$lambda.min
best.ridge.lambda <- cv_ridge$lambda.min
best.elastic.net.lambda <- cv_elastic_net$lambda.min
# Lasso Regression (L1 Regularization)
lasso_model.opt <- glmnet(X_train, y_train, alpha = 1, lambda = best.lasso.lambda)
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
lasso_predictions.opt <- predict(lasso_model.opt, s = best.lasso.lambda, newx = X_test)
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
lasso_rmse.opt <- sqrt(mean((y_test - lasso_predictions.opt)^2))
# Ridge Regression (L2 Regularization)
ridge_model.opt <- glmnet(X_train, y_train, alpha = 0, lambda = best.ridge.lambda)
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
ridge_predictions.opt <- predict(ridge_model.opt, s = best.ridge.lambda, newx = X_test)
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
ridge_rmse.opt <- sqrt(mean((y_test - ridge_predictions.opt)^2))
# Elastic Net (Combination of L1 and L2)
elastic_net_model.opt <- glmnet(X_train, y_train, alpha = 0.5, lambda = best.elastic.net.lambda)
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
elastic_net_predictions.opt <- predict(elastic_net_model.opt, s = best.elastic.net.lambda, newx = X_test)
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
elastic_net_rmse.opt <- sqrt(mean((y_test - elastic_net_predictions.opt)^2))
# Combine RMSE values for comparison
RMSE.opt = cbind(LASSO.opt = lasso_rmse.opt,
Ridge.opt = ridge_rmse.opt,
Elasticnet.opt = elastic_net_rmse.opt)
# Display the results using pander
pander(RMSE.opt)
| LASSO.opt | Ridge.opt | Elasticnet.opt |
|---|---|---|
| 15.77 | 15.76 | 15.77 |
The resulting LASSO regression equation is given by
# Extract coefficients for the best lambda
best_lambda.lasso <- cv_lasso$lambda.min
coefficients.lasso <- coef(cv_lasso, s = best_lambda.lasso)
# Extract the intercept and betas
intercept.lasso <- coefficients.lasso[1]
betas.lasso <- coefficients.lasso[-1]
# Reconstruct the model equation as a string
model_equation <- paste("Model equation: y =", round(intercept.lasso, 4),
"+", paste(round(betas.lasso, 4),
colnames(X),
sep = "*", collapse = " + "), "\n")
# Print the model equation
cat(model_equation)
Model equation: y = 34.5355 + 1.0482*male + 3.1015*age_log + 0*education + 0*currentSmoker + -0.0978*cigsPerDay + 0*BPMeds + 0*prevalentStroke + 0*prevalentHyp + -1.6997*totChol_log + 2.9522*BMI_sqrt + -923.3899*heartRate_inv + 3.6417*TenYearCHD
The resulting Ridge regression equation is given by
# Extract coefficients for the best lambda
best_lambda.ridge <- cv_ridge$lambda.min
coefficients.ridge <- coef(cv_ridge, s = best_lambda.ridge)
# Extract the intercept and betas
intercept.ridge <- coefficients.ridge[1]
betas.ridge <- coefficients.ridge[-1]
# Reconstruct the model equation as a string
model_equation_ridge <- paste("Model equation: y =", round(intercept.ridge, 4),
"+", paste(round(betas.ridge, 4),
colnames(X),
sep = "*", collapse = " + "), "\n")
# Print the model equation
cat(model_equation_ridge)
Model equation: y = 33.1105 + 0.9322*male + 2.9493*age_log + 0*education + 0*currentSmoker + -0.0908*cigsPerDay + 0*BPMeds + 0*prevalentStroke + 0*prevalentHyp + -1.416*totChol_log + 2.852*BMI_sqrt + -860.895*heartRate_inv + 3.4759*TenYearCHD
The resulting ElasticNet regression equation is given by
## Elastic Net
# Extract coefficients for the best lambda
best_lambda.net <- cv_elastic_net$lambda.min
coefficients.net <- coef(cv_elastic_net, s = best_lambda.net)
# Extract the intercept and betas
intercept.net <- coefficients.net[1]
betas.net <- coefficients.net[-1]
# Reconstruct the model equation as a string
model_equation_net <- paste("Model equation: y =", round(intercept.net, 4),
"+", paste(round(betas.net, 4),
colnames(X),
sep = "*", collapse = " + "), "\n")
# Print the model equation
cat(model_equation_net)
Model equation: y = 34.5217 + 1.0471*male + 3.1003*age_log + 0*education + 0*currentSmoker + -0.0978*cigsPerDay + 0*BPMeds + 0*prevalentStroke + 0*prevalentHyp + -1.697*totChol_log + 2.9514*BMI_sqrt + -922.8733*heartRate_inv + 3.6403*TenYearCHD
Based on the RMSE (Root Mean Squared Error) values from the cross-validation results, the model with the lowest RMSE value indicates the best predictive performance for the glucose variable in the hearts_final dataset. In this case, with the RMSE of 22.21, the best model to use the Ridge Regression.
Regularized logistic regression is a powerful statistical technique used for binary classification tasks, where the goal is to predict the probability of a binary outcome (Having coronary heart Disease). Unlike traditional logistic regression, regularized logistic regression incorporates penalty terms (L1 or L2 regularization) to control the complexity of the model, preventing overfitting and improving generalization to unseen data.
Below is the code for the full logistical model
library(glmnet)
library(caret)
library(pander)
# Define the features (predictors) and target
X <- as.matrix(hearts_final[, -which(names(hearts_final) == "glucose")])
# Predictors
y <- hearts_final$TenYearCHD # Binary target variable (TenYearCHD)
# Ensure target variable is binary (0 or 1)
y <- ifelse(y == 1, 1, 0)
# Split the data into training and testing sets
set.seed(123)
trainIndex <- createDataPartition(y, p = 0.8, list = FALSE)
X_train <- X[trainIndex, ]
X_test <- X[-trainIndex, ]
y_train <- y[trainIndex]
y_test <- y[-trainIndex]
####################
# Fit LASSO model (L1 Regularization)
####################
lasso_model <- glmnet(X_train, y_train, family = "binomial", alpha = 1)
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
# Cross-validation to find the optimal lambda for Lasso
cv_lasso <- cv.glmnet(X_train, y_train, family = "binomial", alpha = 1)
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
lambda_lasso <- cv_lasso$lambda.min
print(lambda_lasso)
[1] 0.0005786163
# Refit the model with the optimal lambda
lasso_model_opt <- glmnet(X_train, y_train, family = "binomial", alpha = 1, lambda = lambda_lasso)
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
# Make predictions on the test set
lasso_predictions <- predict(lasso_model_opt, s = lambda_lasso, newx = X_test, type = "response")
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
# Convert predicted probabilities to binary predictions (threshold 0.5)
lasso_binary_predictions <- ifelse(lasso_predictions > 0.5, 1, 0)
# Evaluate the model using confusion matrix
confusion_matrix_lasso <- confusionMatrix(factor(lasso_binary_predictions), factor(y_test))
print(confusion_matrix_lasso)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 607 0
1 0 124
Accuracy : 1
95% CI : (0.995, 1)
No Information Rate : 0.8304
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Sensitivity : 1.0000
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 1.0000
Prevalence : 0.8304
Detection Rate : 0.8304
Detection Prevalence : 0.8304
Balanced Accuracy : 1.0000
'Positive' Class : 0
# Print Lasso model coefficients
coef(lasso_model_opt, s = lambda_lasso)
13 x 1 sparse Matrix of class "dgCMatrix"
s1
(Intercept) -8.329762
male .
age_log .
education .
currentSmoker .
cigsPerDay .
BPMeds .
prevalentStroke .
prevalentHyp .
totChol_log .
BMI_sqrt .
heartRate_inv .
TenYearCHD 14.907471
####################
# Fit Ridge model (L2 Regularization)
####################
ridge_model <- glmnet(X_train, y_train, family = "binomial", alpha = 0)
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
# Cross-validation to find the optimal lambda for Ridge
cv_ridge <- cv.glmnet(X_train, y_train, family = "binomial", alpha = 0)
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
lambda_ridge <- cv_ridge$lambda.min
print(lambda_ridge)
[1] 0.03550336
# Refit the model with the optimal lambda
ridge_model_opt <- glmnet(X_train, y_train, family = "binomial", alpha = 0, lambda = lambda_ridge)
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
# Make predictions on the test set
ridge_predictions <- predict(ridge_model_opt, s = lambda_ridge, newx = X_test, type = "response")
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
# Convert predicted probabilities to binary predictions (threshold 0.5)
ridge_binary_predictions <- ifelse(ridge_predictions > 0.5, 1, 0)
# Evaluate the model using confusion matrix
confusion_matrix_ridge <- confusionMatrix(factor(ridge_binary_predictions), factor(y_test))
print(confusion_matrix_ridge)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 607 0
1 0 124
Accuracy : 1
95% CI : (0.995, 1)
No Information Rate : 0.8304
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Sensitivity : 1.0000
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 1.0000
Prevalence : 0.8304
Detection Rate : 0.8304
Detection Prevalence : 0.8304
Balanced Accuracy : 1.0000
'Positive' Class : 0
# Print Ridge model coefficients
coef(ridge_model_opt, s = lambda_ridge)
13 x 1 sparse Matrix of class "dgCMatrix"
s1
(Intercept) -7.385601890
male 0.144641931
age_log 0.418075817
education .
currentSmoker .
cigsPerDay 0.003701865
BPMeds .
prevalentStroke .
prevalentHyp .
totChol_log 0.337881740
BMI_sqrt 0.146411010
heartRate_inv -5.888671645
TenYearCHD 5.137098262
####################
# Fit Elastic Net model (L1 and L2 Regularization)
####################
elastic_model <- glmnet(X_train, y_train, family = "binomial", alpha = 0.5)
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
# Cross-validation to find the optimal lambda for Elastic Net
cv_elastic <- cv.glmnet(X_train, y_train, family = "binomial", alpha = 0.5)
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
lambda_elastic <- cv_elastic$lambda.min
print(lambda_elastic)
[1] 0.0001800279
# Refit the model with the optimal lambda
elastic_model_opt <- glmnet(X_train, y_train, family = "binomial", alpha = 0.5, lambda = lambda_elastic)
Warning in storage.mode(xd) <- "double": NAs introduced by coercion
# Make predictions on the test set
elastic_predictions <- predict(elastic_model_opt, s = lambda_elastic, newx = X_test, type = "response")
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
# Convert predicted probabilities to binary predictions (threshold 0.5)
elastic_binary_predictions <- ifelse(elastic_predictions > 0.5, 1, 0)
# Evaluate the model using confusion matrix
confusion_matrix_elastic <- confusionMatrix(factor(elastic_binary_predictions), factor(y_test))
print(confusion_matrix_elastic)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 607 0
1 0 124
Accuracy : 1
95% CI : (0.995, 1)
No Information Rate : 0.8304
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Sensitivity : 1.0000
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 1.0000
Prevalence : 0.8304
Detection Rate : 0.8304
Detection Prevalence : 0.8304
Balanced Accuracy : 1.0000
'Positive' Class : 0
# Print Elastic Net model coefficients
coef(elastic_model_opt, s = lambda_elastic)
13 x 1 sparse Matrix of class "dgCMatrix"
s1
(Intercept) -8.8573526
male .
age_log 0.1806627
education .
currentSmoker .
cigsPerDay .
BPMeds .
prevalentStroke .
prevalentHyp .
totChol_log .
BMI_sqrt .
heartRate_inv .
TenYearCHD 14.8884964
In binary classification, the model outputs probabilities rather than direct class labels. These probabilities represent the likelihood of the event (in this case, the occurrence of heart disease in 10 years). Typically, a threshold (cutoff) is applied to convert these continuous probabilities into binary outcomes (e.g., 0 or 1).
The optimal cutoff probability is determined by maximizing the model’s performance based on the specific goals of the analysis. By adjusting the cutoff threshold, we can find a balance that maximizes model performance.
#############################
# Predict on the test set: type = "response" gives probabilities
predict_lasso <- predict(lasso_model_opt, newx = X_test, type = "response")
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
predict_ridge <- predict(ridge_model_opt, newx = X_test, type = "response")
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
predict_elastic <- predict(elastic_model_opt, newx = X_test, type = "response")
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
###########################################
## Optimal cutoff probability determination
seq.cut <- seq(0, 1, length = 50) # Sequence of cutoff values
acc.lasso <- NULL
acc.ridge <- NULL
acc.elastic <- NULL
for (i in 1:length(seq.cut)) {
# Convert probabilities to binary predictions based on the cutoff value
predy.lasso <- ifelse(predict_lasso > seq.cut[i], 1, 0)
predy.ridge <- ifelse(predict_ridge > seq.cut[i], 1, 0)
predy.elastic <- ifelse(predict_elastic > seq.cut[i], 1, 0)
# Calculate accuracy for each model
acc.lasso[i] <- mean(y_test == predy.lasso)
acc.ridge[i] <- mean(y_test == predy.ridge)
acc.elastic[i] <- mean(y_test == predy.elastic)
}
## Optimal cut-off: average cutoff if multiple cutoffs give max accuracy
opt.cut.lasso <- mean(seq.cut[which(acc.lasso == max(acc.lasso))])
opt.cut.ridge <- mean(seq.cut[which(acc.ridge == max(acc.ridge))])
opt.cut.elastic <- mean(seq.cut[which(acc.elastic == max(acc.elastic))])
## Data frame to store accuracies and cutoff probabilities
acc.data <- data.frame(
prob = rep(seq.cut, 3),
acc = c(acc.lasso, acc.ridge, acc.elastic),
group = c(rep("lasso", 50), rep("ridge", 50), rep("elastic", 50))
)
# Plot accuracy vs. cutoff for each model
library(ggplot2)
ggplot(acc.data, aes(x = prob, y = acc, color = group)) +
geom_line() +
labs(title = "Accuracy vs. Cutoff for LASSO, Ridge, and Elastic Net",
x = "Cutoff Probability", y = "Accuracy") +
theme_minimal()
# Print the optimal cutoff probabilities
cat("Optimal cutoff for LASSO: ", opt.cut.lasso, "\n")
Optimal cutoff for LASSO: 0.5
cat("Optimal cutoff for Ridge: ", opt.cut.ridge, "\n")
Optimal cutoff for Ridge: 0.3673469
cat("Optimal cutoff for Elastic Net: ", opt.cut.elastic, "\n")
Optimal cutoff for Elastic Net: 0.5
Here we can see that each model has a different optimal cutoff.
gg.acc <- ggplot(data = acc.data, aes(x=prob, y = acc, color = group)) +
geom_line() +
annotate("text", x = 0.6, y = 0.45,
label = paste("LASSO cutoff: ", round(opt.cut.lasso,5), "Accuracy: ", round(max(acc.lasso),5),
"\nRidge cutoff: ", round(opt.cut.ridge,5), "Accuracy: ", round(max(acc.ridge),5),
"\nElastic cutoff: ", round(opt.cut.elastic,5), "Accuracy: ", round(max(acc.elastic),5)),
size = 3,
color = "navy") +
ggtitle("Cut-off Probability vs Accuracy") +
labs(x = "cut-off Probability",
y = "accuracy", color = "Group") +
theme(plot.title = element_text(hjust = 0.5))
##
ggplotly(gg.acc)
Now using these cutoff we can predict and assign labels.
#######################################
## using the optimal cutoff probability to predict labels
##
pred.lab.lasso <- ifelse(predict_lasso >opt.cut.lasso, 1, 0)
pred.lab.ridge<- ifelse(predict_ridge >opt.cut.ridge, 1, 0)
pred.lab.elastic<- ifelse(predict_elastic >opt.cut.elastic, 1, 0)
#################################
# Convert predictions to factors
pred.lab.lasso.fct <- as.factor(pred.lab.lasso)
pred.lab.ridge.fct <- as.factor(pred.lab.ridge)
pred.lab.elastic.fct <- as.factor(pred.lab.elastic)
# Convert actual values to factors
y_test <- as.factor(y_test)
# Confusion Matrix and Metrics
confusion.lasso <- confusionMatrix(pred.lab.lasso.fct, y_test)
confusion.ridge<- confusionMatrix(pred.lab.ridge.fct, y_test)
confusion.elastic <- confusionMatrix(pred.lab.elastic.fct, y_test)
## Commonly used performance measured
PerfMeasures <- cbind(lasso = confusion.lasso$byClass,
ridge = confusion.ridge$byClass,
elastic = confusion.elastic$byClass)
pander(PerfMeasures)
| lasso | ridge | elastic | |
|---|---|---|---|
| Sensitivity | 1 | 1 | 1 |
| Specificity | 1 | 1 | 1 |
| Pos Pred Value | 1 | 1 | 1 |
| Neg Pred Value | 1 | 1 | 1 |
| Precision | 1 | 1 | 1 |
| Recall | 1 | 1 | 1 |
| F1 | 1 | 1 | 1 |
| Prevalence | 0.8304 | 0.8304 | 0.8304 |
| Detection Rate | 0.8304 | 0.8304 | 0.8304 |
| Detection Prevalence | 0.8304 | 0.8304 | 0.8304 |
| Balanced Accuracy | 1 | 1 | 1 |
The ROC curve is a graphical representation that illustrates the trade-off between sensitivity (true positive rate) and 1-specificity (false positive rate) across different threshold values. A model that performs well will have a ROC curve that rises sharply towards the top-left corner, indicating high sensitivity and low false positive rate. The Area Under the Curve (AUC) is another important metric derived from the ROC analysis, which quantifies the overall ability of the model to discriminate between the positive and negative classes. An AUC value closer to 1 indicates excellent model performance, while a value closer to 0.5 suggests the model has no discriminatory power.
library(pROC)
# Predicted probabilities for each model
prob_lasso <- predict(lasso_model_opt, newx = X_test, type = "response")
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
prob_ridge <- predict(ridge_model_opt, newx = X_test, type = "response")
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
prob_elastic <- predict(elastic_model_opt, newx = X_test, type = "response")
Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
# Compute ROC curves
roc_lasso <- roc(y_test, prob_lasso)
Warning in roc.default(y_test, prob_lasso): Deprecated use a matrix as
predictor. Unexpected results may be produced, please pass a numeric vector.
roc_ridge <- roc(y_test, prob_ridge)
Warning in roc.default(y_test, prob_ridge): Deprecated use a matrix as
predictor. Unexpected results may be produced, please pass a numeric vector.
roc_elastic <- roc(y_test, prob_elastic)
Warning in roc.default(y_test, prob_elastic): Deprecated use a matrix as
predictor. Unexpected results may be produced, please pass a numeric vector.
# Compute AUC values
auc_lasso <- auc(roc_lasso)
auc_ridge <- auc(roc_ridge)
auc_elastic <- auc(roc_elastic)
## Extract sensitivity and specificity values
sen.lasso <- roc_lasso$sensitivities
spe.lasso <- roc_lasso$specificities
sen.ridge <- roc_ridge$sensitivities
spe.ridge <- roc_ridge$specificities
sen.elastic <- roc_elastic$sensitivities
spe.elastic <- roc_elastic$specificities
# Plot the ROC curves
plot(1 - spe.lasso, sen.lasso,
type = "l", col = "green",
xlim = c(0,1),
xlab = "1 - Specificity",
ylab = "Sensitivity",
main = "ROC Curves for LASSO, Ridge, and Elastic Net")
lines(1 - spe.ridge, sen.ridge, col = "orange")
lines(1 - spe.elastic, sen.elastic, col = "purple")
abline(0, 1, type = "l", lty = 2, col = "steelblue", lwd = 1) # Diagonal line
Warning in int_abline(a = a, b = b, h = h, v = v, untf = untf, ...): graphical
parameter "type" is obsolete
# Add legend
legend("bottomright", legend = c(paste("LASSO (AUC =", round(auc_lasso, 3), ")"),
paste("Ridge (AUC =", round(auc_ridge, 3), ")"),
paste("Elastic Net (AUC =", round(auc_elastic, 3), ")")),
col = c("green", "orange", "purple"), lty = 1, cex = 0.8, bty = "n")
5-fold CV performance plot
colnames(hearts_final)
[1] "male" "age_log" "education" "currentSmoker"
[5] "cigsPerDay" "BPMeds" "prevalentStroke" "prevalentHyp"
[9] "totChol_log" "BMI_sqrt" "glucose" "heartRate_inv"
[13] "TenYearCHD"
The above figure indicates that the optimal cut-off probability that yields the best accuracy is 0.48.
We will fit Support Vector Regression (SVR) models with both linear and radial basis function (RBF) kernels, as well as an ordinary least squares (OLS) regression model (with step-wise variable selection). The performance of these three regression models will be evaluated using mean squared error (MSE) and mean absolute error (MAE). This will be used to predict glucose levels in patients.
# Load required libraries
library(e1071)
# Assuming hearts_final is your dataset
# Check for missing values and remove rows with NA values
X <- hearts_final[, -which(names(hearts_final) == "glucose")] # All features except 'glucose'
y <- hearts_final$glucose # Target variable 'glucose'
# Remove rows with NA values from both predictors (X) and target (y)
complete_data <- complete.cases(X, y)
X <- X[complete_data, ]
y <- y[complete_data]
# Ensure all predictor columns are numeric
X[] <- lapply(X, as.numeric) # Convert all columns in X to numeric
# Scale the predictors (optional but helps in SVMs)
X <- scale(X)
# Split the data into training and test sets (80-20 split)
set.seed(123) # For reproducibility
train.index <- sample(1:nrow(X), 0.8 * nrow(X))
X.train <- X[train.index, ]
y.train <- y[train.index]
X.test <- X[-train.index, ]
y.test <- y[-train.index]
# Set up the grid for hyperparameters (RBF kernel)
tune.grid <- expand.grid(
epsilon = seq(0.1, 0.5, 0.1),
cost = c(1, 10, 100),
gamma = c(0.01, 0.1, 1)
)
# Set up cross-validation control (5-fold cross-validation)
tune.control <- tune.control(
cross = 5, # 5-fold cross-validation
nrepeat = 1 # Number of repetitions
)
# Perform grid search for hyperparameter tuning: RBF kernel
tune.RBF <- tune(
svm,
train.x = X.train,
train.y = y.train,
ranges = list(epsilon = seq(0.1, 0.5, 0.1),
cost = c(1, 10, 100),
gamma = c(0.01, 0.1, 1)), # Hyperparameters for RBF kernel
tunecontrol = tune.control(sampling = "cross", cross = 5) # 5-fold cross-validation
)
# Check the best parameters found
print(tune.RBF$best.parameters)
epsilon cost gamma
5 0.5 1 0.01
# Train the final model using the best parameters for RBF kernel
final.RBF <- svm(
X.train, y.train,
type = "eps-regression", # "eps-regression" for SVR
kernel = "radial",
epsilon = tune.RBF$best.parameters$epsilon,
cost = tune.RBF$best.parameters$cost,
gamma = tune.RBF$best.parameters$gamma
)
# Make predictions on the test set
pred.RBF <- predict(final.RBF, X.test)
# Evaluate performance (Mean Squared Error and Mean Absolute Error)
mse.RBF <- mean((y.test - pred.RBF)^2) # Mean squared error
mae.RBF <- mean(abs(y.test - pred.RBF)) # Mean absolute error
# Print the performance metrics
print(paste("MSE for RBF Kernel:", mse.RBF))
[1] "MSE for RBF Kernel: 245.408530793048"
print(paste("MAE for RBF Kernel:", mae.RBF))
[1] "MAE for RBF Kernel: 10.6550225304335"
# Optionally, you can perform similar steps for the linear kernel (if needed)
# Perform grid search for hyperparameter tuning: Linear kernel
tune.lin <- tune(
svm,
train.x = X.train,
train.y = y.train,
ranges = list(epsilon = seq(0.1, 0.5, 0.1),
cost = c(1, 10, 100)), # Hyperparameters for Linear kernel
tunecontrol = tune.control(sampling = "cross", cross = 5) # 5-fold cross-validation
)
# Check the best parameters for Linear kernel
print(tune.lin$best.parameters)
epsilon cost
4 0.4 1
# Train the final model using the best parameters for Linear kernel
final.lin <- svm(
X.train, y.train,
type = "eps-regression", # "eps-regression" for SVR
kernel = "linear",
epsilon = tune.lin$best.parameters$epsilon,
cost = tune.lin$best.parameters$cost
)
# Make predictions on the test set
pred.lin <- predict(final.lin, X.test)
# Evaluate performance for Linear kernel
mse.lin <- mean((y.test - pred.lin)^2) # Mean squared error
mae.lin <- mean(abs(y.test - pred.lin)) # Mean absolute error
# Print the performance metrics for Linear kernel
print(paste("MSE for Linear Kernel:", mse.lin))
[1] "MSE for Linear Kernel: 247.173855031737"
print(paste("MAE for Linear Kernel:", mae.lin))
[1] "MAE for Linear Kernel: 10.6972173466993"
# Load necessary libraries
library(MASS) # For stepAIC()
# Fit the initial OLS model (using all predictors)
lse.fit <- lm(glucose ~ ., data = hearts_final)
# Apply stepwise AIC model selection (both directions: forward and backward)
AIC.fit <- stepAIC(lse.fit, direction = "both", trace = FALSE)
# Split the data into features (X) and target (y) (80-20 split as before)
set.seed(123)
train.index <- sample(1:nrow(hearts_final), 0.8 * nrow(hearts_final))
X.train <- hearts_final[train.index, -which(names(hearts_final) == "glucose")]
y.train <- hearts_final$glucose[train.index]
X.test <- hearts_final[-train.index, -which(names(hearts_final) == "glucose")]
y.test <- hearts_final$glucose[-train.index]
# Make predictions on the test set using the stepwise-selected model
pred.lse <- predict(AIC.fit, newdata = X.test)
# Calculate Mean Squared Error (MSE) and Mean Absolute Error (MAE)
mse.lse <- mean((y.test - pred.lse)^2) # Mean squared error
mae.lse <- mean(abs(y.test - pred.lse)) # Mean absolute error
# Print the performance metrics
print(paste("MSE for OLS model:", mse.lse))
[1] "MSE for OLS model: 235.110402500261"
print(paste("MAE for OLS model:", mae.lse))
[1] "MAE for OLS model: 10.7914601298076"
# Diagnostic plots for the final model
par(mfrow = c(2, 2), mar = c(2, 2, 2, 2))
plot(AIC.fit)
The residual plot (top left panel) shows no curve patterns in the data, meaning that it does have a linear regression. Therefore, we do not need to refit the model.
Next, we calculate the predictive errors of the candidate models in the following table.
Performance <- data.frame(RBF.SVR=c(mse.RBF, mae.RBF),
Linear.SVR = c(mse.lin, mae.lin))
row.names(Performance) <- c("MSE", "MAE")
##
pander(Performance)
| RBF.SVR | Linear.SVR | |
|---|---|---|
| MSE | 245.4 | 247.2 |
| MAE | 10.66 | 10.7 |
The above predictive errors show that the linear support vector machine outperforms linear kernel based SVR regression models.
When it comes to answering the question if someone will have coronary heart disease in ten years (TenYearsCHD) we can use a SVM to find a hyperplane that maximizes the margin between the two classes.
# Load necessary libraries
library(e1071) # For svm()
# Two-way data splitting: Train (70%) and Test (30%)
set.seed(123) # For reproducibility
index <- sample(1:nrow(hearts_final), 0.7 * nrow(hearts_final))
train.data <- hearts_final[index, ]
test.data <- hearts_final[-index, ]
# Set up custom cross-validation control (5-fold cross-validation)
tune_control <- tune.control(
cross = 5, # Use 5-fold cross-validation
nrepeat = 1 # Number of repetitions (for repeated cross-validation)
)
# Perform a grid search for the best hyperparameters for the RBF kernel
tune.RBF <- tune(
svm, # SVM algorithm for tuning
TenYearCHD ~ ., # Use TenYearCHD as the target variable
data = train.data,
kernel = "radial", # Radial basis function kernel
ranges = list(
cost = 10^(-1:2), # Tuning hyperparameter C in the loss function
gamma = c(0.1, 0.5, 1, 2) # Hyperparameter gamma for the RBF kernel
),
tunecontrol = tune_control # Use the defined cross-validation settings
)
# Print the tuning results for inspection
# print(tune.RBF)
# Extract the best model and hyperparameters
best.RBF <- tune.RBF$best.model
best.cost.RBF <- best.RBF$cost
best.gamma.RBF <- best.RBF$gamma
# Print the best hyperparameters
cat("Best Cost:", best.cost.RBF, "\n")
Best Cost: 1
cat("Best Gamma:", best.gamma.RBF, "\n")
Best Gamma: 0.1
# Train the final SVM model with the best hyperparameters
final.RBF.class <- svm(
TenYearCHD ~ ., # Target variable: TenYearCHD
data = train.data, # Training data
kernel = "radial", # Radial kernel
cost = best.cost.RBF, # Best cost from tuning
gamma = best.gamma.RBF # Best gamma from tuning
)
# Print the final model for inspection
# print(final.RBF.class)
# Make predictions on the test set
pred.RBF.class <- predict(final.RBF.class, test.data, type = "class")
# Evaluate the model using a confusion matrix
confusion.matrix.RBF <- table(Predicted = pred.RBF.class, Actual = test.data$TenYearCHD)
# Print the confusion matrix
print(confusion.matrix.RBF)
Actual
Predicted 0 1
0 924 169
1 2 3
# Optionally, calculate accuracy or other metrics (e.g., precision, recall)
accuracy <- sum(diag(confusion.matrix.RBF)) / sum(confusion.matrix.RBF)
cat("Accuracy:", accuracy, "\n")
Accuracy: 0.8442623
# Calculate accuracy
accuracy <- sum(diag(confusion.matrix.RBF)) / sum(confusion.matrix.RBF)
cat("\n\n Accuracy:", accuracy, "\n")
Accuracy: 0.8442623
Next, we assess the global performance through ROC analysis. Since ROC analysis is usually used to compare two or more binary classification models, we will build two SVM models with linear and RBF kernels respectively, and the standard binary logistic regression models and then compare the three candidate classification models using ROC and AUC.
# Load necessary libraries
library(e1071) # For svm()
library(pROC) # For ROC curve
library(MASS) # For stepAIC()
# Assuming the 'hearts_final' dataset is already loaded
# Split the data into training (70%) and testing (30%) sets
set.seed(123) # For reproducibility
index <- sample(1:nrow(hearts_final), 0.7 * nrow(hearts_final))
train.data <- hearts_final[index, ]
test.data <- hearts_final[-index, ]
# Set up custom cross-validation control (5-fold cross-validation)
tune.control <- tune.control(
cross = 5, # 5-fold cross-validation
nrepeat = 1 # Number of repetitions (for repeated cross-validation)
)
## Linear SVM Grid Search for Best Hyperparameters
tune.lin <- tune(
svm, # SVM algorithm
TenYearCHD ~ ., # Target variable is 'TenYearCHD'
data = train.data,
kernel = "linear", # Linear kernel
ranges = list(
cost = 10^(-1:2) # Tuning hyperparameter 'C' in the loss function
),
tunecontrol = tune.control # Use custom cross-validation settings
)
# Extract the best model and hyperparameters for Linear SVM
best.lin <- tune.lin$best.model
best.cost.lin <- best.lin$cost
# Train the final Linear SVM model with the best hyperparameters
final.lin <- svm(
TenYearCHD ~ ., # Target variable: 'TenYearCHD'
data = train.data, # Training data
kernel = "linear", # Linear kernel
cost = best.cost.lin, # Best cost from tuning
probability = TRUE # Request probability estimates
)
## Radial SVM Grid Search for Best Hyperparameters
tune.RBF <- tune(
svm, # SVM algorithm
TenYearCHD ~ ., # Target variable is 'TenYearCHD'
data = train.data,
kernel = "radial", # Radial kernel
ranges = list(
cost = 10^(-1:2), # Tuning hyperparameter 'C'
gamma = c(0.1, 0.5, 1, 2) # Tuning 'gamma' for the radial kernel
),
tunecontrol = tune.control # Custom cross-validation settings
)
# Extract the best model and hyperparameters for Radial SVM
best.RBF <- tune.RBF$best.model
best.cost.RBF <- best.RBF$cost
best.gamma.RBF <- best.RBF$gamma
# Train the final Radial SVM model with the best hyperparameters
final.RBF <- svm(
TenYearCHD ~ ., # Target variable: 'TenYearCHD'
data = train.data, # Training data
kernel = "radial", # Radial kernel
cost = best.cost.RBF, # Best cost from tuning
gamma = best.gamma.RBF, # Best gamma from tuning
probability = TRUE # Request probability estimates
)
######################
### Logistic Regression Model
logit.fit <- glm(TenYearCHD ~ ., data = train.data, family = binomial)
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
AIC.logit <- step(logit.fit, direction = "both", trace = 0) # Stepwise selection
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
pred.logit <- predict(AIC.logit, test.data, type = "response")
###################
# ROC Curve and AUC for the models
# Get the predicted probabilities for Linear SVM, Radial SVM, and Logistic Regression
pred.prob.lin <- predict(final.lin, test.data, probability = TRUE)
pred.prob.RBF <- predict(final.RBF, test.data, probability = TRUE)
# Extracting the probabilities for the positive class
prob.linear <- attr(pred.prob.lin, "probabilities")[, 2]
prob.radial <- attr(pred.prob.RBF, "probabilities")[, 2]
# Compute ROC curves
roc_lin <- roc(test.data$TenYearCHD, prob.linear)
roc_RBF <- roc(test.data$TenYearCHD, prob.radial)
roc_logit <- roc(test.data$TenYearCHD, pred.logit)
# Sensitivity and Specificity for each model
lin.sen <- roc_lin$sensitivities
lin.spe <- roc_lin$specificities
rad.sen <- roc_RBF$sensitivities
rad.spe <- roc_RBF$specificities
logit.sen <- roc_logit$sensitivities
logit.spe <- roc_logit$specificities
# AUC values
auc.lin <- roc_lin$auc
auc.rad <- roc_RBF$auc
auc.logit <- roc_logit$auc
# Plot ROC curves
plot(1 - lin.spe, lin.sen,
xlab = "1 - Specificity",
ylab = "Sensitivity",
col = "darkred",
type = "l",
lty = 1,
lwd = 1,
main = "ROC Curves of SVM and Logistic Regression")
lines(1 - rad.spe, rad.sen,
col = "blue",
lty = 1,
lwd = 1)
lines(1 - logit.spe, logit.sen,
col = "orange",
lty = 1,
lwd = 1)
# Add the diagonal line for random guessing
abline(0, 1, col = "skyblue3", lty = 2, lwd = 2)
# Add vertical lines for thresholds
abline(v = c(0.049, 0.151), lty = 3, col = "darkgreen")
# Legend for the plot
legend("bottomright", c("Linear SVM", "Radial SVM", "Logistic Regression"),
lty = c(1, 1, 1), lwd = rep(1, 3),
col = c("red", "blue", "orange"),
bty = "n", cex = 0.8)
# Annotate with AUC values
text(0.8, 0.46, paste("Linear AUC: ", round(auc.lin, 4)), cex = 0.8)
text(0.8, 0.4, paste("Radial AUC: ", round(auc.rad, 4)), cex = 0.8)
text(0.8, 0.34, paste("Logistic AUC: ", round(auc.logit, 4)), cex = 0.8)
The ROC curve above indicates that the linear SVM, RBF SVM, and standard linear logistic regression models do not perform equally well on a global scale. However, one notable observation from the ROC curve is that the sensitivity of both the Radial and logistic regression models is consistently higher than that of the linear SVM when the specificity level is between 85% and 95%. For the question of classification, the best model to use is the logistic regression
Overall in this project we have done some EDA with linear/logistical regularized regression, as well as utilizing SVMs to predict glucose levels and label if someone will have CHD in ten years. We have cross-validated these different models to see which one performed the best in order to answer these key questions.